The Data Set

The source of the data we use in this lab is from Open ML. In particular, we are going to use the Phoneme data set, the description of which can be found here: https://www.openml.org/d/1489.

phoneme = read.csv("phoneme.csv", stringsAsFactors=TRUE)
head(phoneme)
##        V1      V2      V3      V4      V5 Class
## 1  1.7494 -0.0504 -0.4321  0.7681 -0.5974 Nasal
## 2  1.9682 -0.8207 -0.2183  0.0900  0.4641 Nasal
## 3 -0.1535 -0.1977  1.1861  0.1394 -0.6372  Oral
## 4 -0.3297 -0.2271  0.3609 -1.5044 -1.9235  Oral
## 5  2.6405 -0.7292 -1.0152 -0.3963 -0.2138 Nasal
## 6 -0.3896 -0.1427  1.1188  0.7823  1.6400  Oral

The data set has 5 numeric predictors and one response variable class, which has two levels: Nasal (nasal vowels) and Oral (oral vowels). The numerical variables have all be standardised (and followed with a rounding to 4 d.p.) to have mean 0 and standard deviation 1, as also in the original data set. A pairwise scatterplot of the data looks like the following:

with(phoneme, pairs(phoneme[,-6], col=c(2,4)[Class], pch=c(1,3)[Class]))

Library needed for this lab:

library(mclust)
## Package 'mclust' version 5.4.7
## Type 'citation("mclust")' for citing this R package in publications.

Tasks

Visualisation

1. When there are a large number of observations, observations of smaller classes may be overwhelmed in a scatterplot by those in larger classes. Therefore, one may prefer to plot the observations of larger classes first and then those of smaller ones, or for a two-class data set those of the majority class and then those of the minority one.

Write R rode to reproduce the following plot which follows the above description. Make sure that the majority class is determined by code automatically (i.e., not by eyes) so it can be reused easily.

table(phoneme$Class)
## 
## Nasal  Oral 
##   709   291
phoneme_nasal <- phoneme[phoneme$Class=="Nasal",]
phoneme_oral <- phoneme[phoneme$Class=="Oral",]
  • the majority class is Nasal
  • the minority class is Oral
da = rbind(phoneme_nasal[,-6],phoneme_oral[,-6])
pairs(da,col = c(rep(2,nrow(phoneme_nasal)), rep(4,nrow(phoneme_oral))),
       pch=c(rep(1,nrow(phoneme_nasal)), rep(3,nrow(phoneme_oral)))
      )

Univariate Density Estimation

2. Plot each of the kernel density estimate, by superimposing it on a histogram (with breaks=100), for V1, with the bandwidth values chosen by methods nrd0, ucv, bcv and SJ, respectively.

par(mfrow=c(2,2))
for(bw in c("nrd0", "ucv", "bcv", "SJ")) {
   hist(phoneme$V1, freq=FALSE, breaks=100, 
        col="gray80", border="white",
        main=paste0("h = ", bw), xlab="V1")
   lines(density(phoneme$V1, bw=bw), col=4, lwd=2)
}

Observe and discuss if both overfitting and underfitting can exist for a KDE.

  • underfitting for nrd0: oversmoothed and does not capture the extent of the peak at -0.7.
  • overfitting for ucv: undersmoothed and is too variable to the data
  • the best of the KDEs: bcv or SJ

3. Find the best normal mixture fit to V1 (according to the BIC), in both equal and varying variance subfamilies, with the number of components ranging from 1 to 20.

(r = densityMclust(phoneme$V1, G=1:20, modelNames=c("E","V")))    # Let BIC decide
## 'densityMclust' model object: (V,6) 
## 
## Available components: 
##  [1] "call"           "data"           "modelName"      "n"             
##  [5] "d"              "G"              "BIC"            "loglik"        
##  [9] "df"             "bic"            "icl"            "hypvol"        
## [13] "parameters"     "z"              "classification" "uncertainty"   
## [17] "density"
summary(r)
## ------------------------------------------------------- 
## Density estimation via Gaussian finite mixture modeling 
## ------------------------------------------------------- 
## 
## Mclust V (univariate, unequal variance) model with 6 components: 
## 
##  log-likelihood    n df       BIC       ICL
##       -939.6717 1000 17 -1996.775 -2467.617
  • the best normal mixture fit to V1 (according to the BIC): (V,6)

Plot the fitted density, along with a histogram of the data.

plot(r, phoneme$V1, "density", breaks=100, lwd=2, col=2) 

Does it look like a better or worse fit than the best of the KDEs?

  • the best of the KDEs: bcv or SJ
  • it looks like a worse fit than the best of KDEs because it seems to underfit (oversmoothed). The peak is also not very accurate.

Multivariate density estimation

4. For each of the two classes, find a density estimate in the equal variance subfamily of multivariate normal mixtures, with the number of components ranging from 1 to 9.

(r_E1 = densityMclust(phoneme_nasal[, -6], G=1:9, modelNames=c("EEE")))    # Let BIC decide
## 'densityMclust' model object: (EEE,9) 
## 
## Available components: 
##  [1] "call"           "data"           "modelName"      "n"             
##  [5] "d"              "G"              "BIC"            "loglik"        
##  [9] "df"             "bic"            "icl"            "hypvol"        
## [13] "parameters"     "z"              "classification" "uncertainty"   
## [17] "density"
summary(r_E1)
## ------------------------------------------------------- 
## Density estimation via Gaussian finite mixture modeling 
## ------------------------------------------------------- 
## 
## Mclust EEE (ellipsoidal, equal volume, shape and orientation) model with 9
## components: 
## 
##  log-likelihood   n df       BIC       ICL
##       -3768.155 709 68 -7982.651 -8102.045
(r_E2 = densityMclust(phoneme_oral[, -6], G=1:9, modelNames=c("EEE")))    # Let BIC decide
## 'densityMclust' model object: (EEE,8) 
## 
## Available components: 
##  [1] "call"           "data"           "modelName"      "n"             
##  [5] "d"              "G"              "BIC"            "loglik"        
##  [9] "df"             "bic"            "icl"            "hypvol"        
## [13] "parameters"     "z"              "classification" "uncertainty"   
## [17] "density"
summary(r_E2)
## ------------------------------------------------------- 
## Density estimation via Gaussian finite mixture modeling 
## ------------------------------------------------------- 
## 
## Mclust EEE (ellipsoidal, equal volume, shape and orientation) model with 8
## components: 
## 
##  log-likelihood   n df       BIC       ICL
##       -1495.103 291 62 -3341.952 -3397.139
  • For class Nasal, the best density estimate in the equal variance subfamily of multivariate normal mixtures, with the number of components ranging from 1 to 9 (according to the BIC): (EEE,9)

  • For class Oral, the best density estimate in the equal variance subfamily of multivariate normal mixtures, with the number of components ranging from 1 to 9 (according to the BIC): (EEE,8)

Show each density in a pairwise plot of the data.

plot(r_E1, phoneme_nasal[, -6], what="density", col=4, points.col="grey")

plot(r_E2, phoneme_oral[, -6], what="density", col=4, points.col="grey")

5. Repeat Task 4, but for the varying variance subfamily of normal mixtures.

(r_V1 = densityMclust(phoneme_nasal[, -6], G=1:9, modelNames=c("VVV")))    # Let BIC decide
## 'densityMclust' model object: (VVV,9) 
## 
## Available components: 
##  [1] "call"           "data"           "modelName"      "n"             
##  [5] "d"              "G"              "BIC"            "loglik"        
##  [9] "df"             "bic"            "icl"            "hypvol"        
## [13] "parameters"     "z"              "classification" "uncertainty"   
## [17] "density"
summary(r_V1)
## ------------------------------------------------------- 
## Density estimation via Gaussian finite mixture modeling 
## ------------------------------------------------------- 
## 
## Mclust VVV (ellipsoidal, varying volume, shape, and orientation) model with 9
## components: 
## 
##  log-likelihood   n  df       BIC       ICL
##       -2507.006 709 188 -6248.017 -6286.289
(r_V2 = densityMclust(phoneme_oral[, -6], G=1:9, modelNames=c("VVV")))    # Let BIC decide
## 'densityMclust' model object: (VVV,3) 
## 
## Available components: 
##  [1] "call"           "data"           "modelName"      "n"             
##  [5] "d"              "G"              "BIC"            "loglik"        
##  [9] "df"             "bic"            "icl"            "hypvol"        
## [13] "parameters"     "z"              "classification" "uncertainty"   
## [17] "density"
summary(r_V2)
## ------------------------------------------------------- 
## Density estimation via Gaussian finite mixture modeling 
## ------------------------------------------------------- 
## 
## Mclust VVV (ellipsoidal, varying volume, shape, and orientation) model with 3
## components: 
## 
##  log-likelihood   n df       BIC       ICL
##       -1453.461 291 62 -3258.669 -3296.499
  • For class Nasal, the best density estimate in the varying variance subfamily of multivariate normal mixtures, with the number of components ranging from 1 to 9 (according to the BIC): (VVV,9)

  • For class Oral, the best density estimate in the varying variance subfamily of multivariate normal mixtures, with the number of components ranging from 1 to 9 (according to the BIC): (VVV,3)

Show each density in a pairwise plot of the data.

plot(r_V1, phoneme_nasal[, -6], what="density", col=4, points.col="grey")

plot(r_V2, phoneme_oral[, -6], what="density", col=4, points.col="grey")

6. By applying the fundamental rule of classification, we obtain immediately a classifier from Tasks 4 and 5, respectively. For each classifier, compute its resubstitution misclassification rate.

equal variance subfamily of normal mixtures (classifier from Tasks 4)

p_E1 <- predict(r_E1, phoneme[, -6])
head(p_E1)
## [1] 1.945426e-03 3.735163e-03 5.198055e-03 2.702895e-05 2.643086e-02
## [6] 3.356227e-04
p_E2 <- predict(r_E2, phoneme[, -6])
head(p_E2)
## [1] 1.923916e-13 9.998861e-16 1.325117e-02 2.587213e-04 1.289425e-22
## [6] 3.095459e-02
# compare which posterior probability larger, then belongs to which class
yhat_E <- ifelse(p_E1>p_E2, 1, 2) 
head(yhat_E) # 1=Nasal, 2=Oral
## [1] 1 1 2 2 1 2
phoneme2 = phoneme
phoneme2$Class <- ifelse(phoneme2$Class == "Nasal", as.integer(1), as.integer(2))
head(phoneme2)
##        V1      V2      V3      V4      V5 Class
## 1  1.7494 -0.0504 -0.4321  0.7681 -0.5974     1
## 2  1.9682 -0.8207 -0.2183  0.0900  0.4641     1
## 3 -0.1535 -0.1977  1.1861  0.1394 -0.6372     2
## 4 -0.3297 -0.2271  0.3609 -1.5044 -1.9235     2
## 5  2.6405 -0.7292 -1.0152 -0.3963 -0.2138     1
## 6 -0.3896 -0.1427  1.1188  0.7823  1.6400     2
(train_error = mean(phoneme2$Class != yhat_E))     # resubstitution misclassification rate
## [1] 0.199

“EEE” resubstitution misclassification rate: 19.9%

varying variance subfamily of normal mixtures (classifier from Tasks 5)

p_V1 <- predict(r_V1, phoneme[, -6])
head(p_V1)
## [1] 3.084490e-03 3.083380e-03 3.961820e-04 1.233538e-09 5.004490e-01
## [6] 1.956755e-04
p_V2 <- predict(r_V2, phoneme[, -6])
head(p_V2)
## [1] 1.022265e-07 2.254148e-09 1.383541e-02 5.228346e-05 3.060883e-14
## [6] 3.265301e-03
# compare which posterior probability larger, then belongs to which class
yhat_V <- ifelse(p_V1>p_V2, 1, 2) 
head(yhat_V) # 1=Nasal, 2=Oral
## [1] 1 1 2 2 1 2
(train_error = mean(phoneme2$Class != yhat_V))     # resubstitution misclassification rate
## [1] 0.164

“VVV” resubstitution misclassification rate: 16.4%

  • “VVV” has a lower resubstitution misclassification rate than “EEE”. This means “VVV” performs better than “EEE” on the training data.

K-means

7. With the K-means method, find the clustering results with two clusters. Show the results in a pairwise plot of the data, using different colors and point types for observations of different clusters. (If you have successfully finished Task 1, then plot the majority cluster first.)

(r_K2 = kmeans(phoneme[, -6], centers=2))     # K = 2
## K-means clustering with 2 clusters of sizes 396, 604
## 
## Cluster means:
##           V1         V2         V3         V4          V5
## 1  0.8236333  0.3351551 -0.8928588 -0.3628811 -0.12521818
## 2 -0.5399972 -0.2197373  0.5853838  0.2379156  0.08210265
## 
## Clustering vector:
##    [1] 1 1 2 2 1 2 2 1 1 1 2 1 2 2 1 2 2 1 1 1 2 2 1 2 2 2 2 2 1 2 2 1 1 1 2 1 1
##   [38] 2 2 2 2 2 2 1 1 1 2 1 1 2 1 2 2 1 1 2 2 2 2 2 1 1 1 1 2 2 2 1 2 1 2 1 2 1
##   [75] 2 1 1 1 2 1 2 1 1 2 2 2 2 1 2 2 1 2 1 2 2 2 2 2 2 2 2 1 1 1 1 2 1 1 2 1 1
##  [112] 2 2 1 1 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 1 1 2 2 2
##  [149] 2 1 2 1 2 1 2 1 2 1 1 2 1 1 2 1 1 2 2 1 1 1 2 1 1 1 2 2 2 2 1 1 1 2 1 2 1
##  [186] 1 2 1 1 2 1 2 1 1 2 2 2 1 2 2 1 2 1 2 1 2 2 2 2 2 2 1 1 2 1 1 2 2 2 1 1 1
##  [223] 2 1 2 2 2 2 2 2 2 2 1 2 2 2 2 2 1 2 2 2 1 2 2 1 2 1 2 2 2 2 2 2 2 2 2 1 2
##  [260] 2 2 2 2 1 2 2 2 2 2 1 2 2 2 1 2 2 1 2 2 2 1 1 2 2 1 2 1 1 1 2 2 2 2 1 1 2
##  [297] 2 1 1 1 1 1 2 1 1 2 2 1 2 2 2 1 1 1 1 2 1 2 1 2 2 1 1 1 2 1 2 1 1 2 1 2 2
##  [334] 2 1 2 1 1 2 2 2 2 1 1 1 2 1 1 2 1 2 1 1 2 2 2 2 1 1 2 1 1 2 1 1 2 2 1 2 2
##  [371] 2 1 2 1 1 1 2 1 1 2 1 1 1 2 2 2 2 1 1 1 1 2 1 2 1 2 2 2 1 2 2 2 2 2 2 2 2
##  [408] 2 2 1 2 2 1 1 2 1 1 1 1 1 2 2 1 2 2 1 2 1 1 2 1 1 2 1 1 2 1 2 2 2 2 1 1 2
##  [445] 2 1 2 2 1 1 1 2 1 1 2 2 2 1 2 2 2 2 2 2 2 1 2 1 2 2 2 2 1 2 1 1 2 2 2 2 2
##  [482] 1 1 2 1 2 2 2 1 1 2 1 2 2 2 2 1 1 1 2 2 2 2 1 1 1 2 1 2 1 2 2 2 2 2 2 1 1
##  [519] 1 1 2 2 1 2 2 2 2 2 2 2 1 2 2 2 2 1 1 2 2 1 2 2 2 1 1 1 1 2 2 1 2 1 1 2 2
##  [556] 2 2 2 1 1 2 1 2 2 2 2 1 2 1 1 1 2 1 1 2 1 1 1 2 1 2 2 2 1 2 1 2 2 1 1 2 1
##  [593] 1 1 2 1 2 1 2 1 2 2 2 2 1 2 2 2 2 2 1 2 1 2 2 1 1 2 1 1 1 2 2 1 2 1 1 2 1
##  [630] 2 1 1 2 2 1 2 2 2 1 1 2 2 2 2 1 1 2 1 1 2 1 2 1 2 1 1 2 2 2 1 2 1 1 1 2 2
##  [667] 2 2 2 1 1 1 1 2 2 2 1 1 1 2 2 2 2 1 1 2 2 2 2 1 2 1 2 1 1 2 2 2 2 1 2 1 2
##  [704] 2 1 2 1 1 2 1 1 2 2 1 2 1 2 2 1 2 2 2 2 1 1 2 2 1 1 2 2 1 1 1 2 2 2 2 1 1
##  [741] 1 2 2 1 1 2 1 1 1 1 2 2 2 2 1 2 2 1 1 2 1 2 2 2 2 2 2 2 1 2 1 2 2 2 2 2 1
##  [778] 2 2 2 2 2 1 1 2 2 2 2 2 1 2 2 2 2 1 2 2 2 1 2 1 1 2 2 2 2 2 2 2 2 2 2 2 1
##  [815] 2 1 1 2 2 1 2 2 1 2 1 2 2 1 2 2 1 2 2 2 1 2 2 1 2 2 2 1 1 2 1 2 2 2 1 2 2
##  [852] 1 2 2 1 2 2 1 1 2 1 2 1 2 2 2 2 2 1 2 2 2 1 2 1 2 2 2 2 2 2 2 1 2 2 2 1 1
##  [889] 2 1 2 1 2 2 1 2 1 2 2 2 2 2 2 2 1 1 2 2 1 2 1 2 2 2 2 1 2 2 2 2 2 2 2 2 2
##  [926] 2 2 1 1 2 2 1 2 1 2 2 2 2 1 1 1 2 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 1 1
##  [963] 2 2 1 1 2 1 2 2 2 2 1 1 1 2 2 2 2 2 2 1 1 2 2 1 2 2 1 2 2 1 2 1 2 2 1 2 2
## [1000] 1
## 
## Within cluster sum of squares by cluster:
## [1] 1356.678 2500.634
##  (between_SS / total_SS =  22.8 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"
table(r_K2$cluster)  # 1=Nasal, 2=Oral
## 
##   1   2 
## 396 604
da = rbind(phoneme_nasal[,-6],phoneme_oral[,-6])
pairs(da, col = c(rep(2,max(table(r_K2$cluster))), 
                 rep(4,min(table(r_K2$cluster)))),
          pch = c(rep(1,max(table(r_K2$cluster))), 
                  rep(3,min(table(r_K2$cluster))))
      )

8. One might be curious to see if the clusters found by a clustering method are anywhere similar to the class labels that are also available in a data set (i.e., to examine the performance of an unsupervised learning method for a supervised learning problem).

Compute the adjusted Rand indices for K=2,…,9 clusters as found by the K-means method, when the given class labels are contrasted. Comment on the results.

set.seed(769)

ari = double(8)

for(k in 2:9) {
  r = kmeans(phoneme2[, -6], centers=k)
  ari[k-1] = adjustedRandIndex(phoneme2$Class, r$cluster)
}

ari
## [1] 0.20997134 0.11711319 0.16520438 0.09069357 0.06728827 0.04566082 0.04665725
## [8] 0.04944605
  • The adjusted Rand index is such a measure, computed by evaluating whether or not each pair of observations are sent to one cluster in one partition, as well as to one cluster in the other partition.
  • Its value is between −1 and 1, where 0 means a random allocation and 1 means a perfect agreement between the two partitions.
  • The adjusted Rand indices for K=2,…,9 clusters decreases from 0.2 to 0.05. Therefore, smaller K better agree with the actual class labels. When K=2, it gives 0.2 agreement with the true class label. This is not a ver high agreement with the true class.

Mixture-based clustering

9. Using the varying variance subfamily of multivariate normal mixtures, find the clustering results with two clusters, and show the results in a pairwise plot as in Task 7.

(r_M2 = Mclust(phoneme[,-6], G=2, modelNames="VVV"))
## 'Mclust' model object: (VVV,2) 
## 
## Available components: 
##  [1] "call"           "data"           "modelName"      "n"             
##  [5] "d"              "G"              "BIC"            "loglik"        
##  [9] "df"             "bic"            "icl"            "hypvol"        
## [13] "parameters"     "z"              "classification" "uncertainty"
p_M2 <- predict(r_M2)$classification             # clustering 

table(p_M2)  # 1=Oral, 2=Nasal
## p_M2
##   1   2 
## 329 671
da = rbind(phoneme_nasal[,-6],phoneme_oral[,-6])
pairs(da, col = c(rep(2,max(table(p_M2))), 
                 rep(4,min(table(p_M2)))),
          pch = c(rep(1,max(table(p_M2))), 
                  rep(3,min(table(p_M2))))
      )

Also compute the adjusted Rand indices for K=2,…,9 clusters (as done in Task 8).

set.seed(769)

ari = double(8)

phoneme3 = phoneme
phoneme3$Class <- ifelse(phoneme3$Class == "Oral", as.integer(1), as.integer(2))
head(phoneme3)
##        V1      V2      V3      V4      V5 Class
## 1  1.7494 -0.0504 -0.4321  0.7681 -0.5974     2
## 2  1.9682 -0.8207 -0.2183  0.0900  0.4641     2
## 3 -0.1535 -0.1977  1.1861  0.1394 -0.6372     1
## 4 -0.3297 -0.2271  0.3609 -1.5044 -1.9235     1
## 5  2.6405 -0.7292 -1.0152 -0.3963 -0.2138     2
## 6 -0.3896 -0.1427  1.1188  0.7823  1.6400     1
for(k in 2:9) {
  r = Mclust(phoneme3[, -6], G=k, modelNames="VVV")
  ari[k-1] = adjustedRandIndex(phoneme3$Class, predict(r)$classification)
}

ari
## [1] 0.02963741 0.05014537 0.07091669 0.05169660 0.03658624 0.03973634 0.03236527
## [8] 0.03572984
  • The adjusted Rand indices for K=2,…,9 clusters ranges from 0.03 to 0.07. Adjusted Rand indices is largest(0.07) when K=4. This means that when K=4, it gives 0.07 agreement with the true class label. This is not a very high agreement, suggesting the model is weak.

Hierarchical Clustering

10. Produce a dendrogram of the hierarchical clustering results using the complete and the single linkage method, respectively.

d = dist(phoneme[,-6])          # pairwise Euclidean distances
class(d)
## [1] "dist"
head(d)
## [1] 1.507829 2.580338 3.450439 1.760060 3.463422 3.110983

Complete linkage

cex = 1.5
r_C = hclust(d)                # complete linkage, by default
names(r_C)
## [1] "merge"       "height"      "order"       "labels"      "method"     
## [6] "call"        "dist.method"
plot(r_C, cex.axis=cex, cex.lab=cex, cex.main=cex)       # dendrogram

Single linkage

r_S = hclust(d, method="single")                # single linkage
names(r_S)
## [1] "merge"       "height"      "order"       "labels"      "method"     
## [6] "call"        "dist.method"
plot(r_S, cex.axis=cex, cex.lab=cex, cex.main=cex)       # dendrogram

Explain why they look very much different.

Complete Linkage: • Use the largest pairwise observation dissimilarity. • Tend to produce compact clusters. • Some observations in the cluster may be closer to the observations in another cluster.

Single Linkage: • Use the smallest pairwise observation dissimilarity. • Close, immediate observations are linked first. • May produce extended, chain-shaped clusters, but almost no gaps between the observations.

Single Linkage in this case has more skewness than complete linkage, also the split is much closer. This is because single Linkage uses the smallest pairwise observation dissimilarity, while complete Linkage uses the largest pairwise observation dissimilarity.

11. Compute the adjusted Rand indices for K=2,…,9 clusters produced by the complete and the single linkage method, respectively (as done in Task 8).

p_C2 <- cutree(r_C, 2)                 # cluster labels of observations for K = 2

table(p_C2)  # 1=Nasal, 2=Oral
## p_C2
##   1   2 
## 825 175

Complete linkage

set.seed(769)

ari = double(8)

for(k in 2:9) {
  r = hclust(d)                # complete linkage, by default
  ari[k-1] = adjustedRandIndex(phoneme2$Class, cutree(r, k))
}

ari
## [1] 0.15231471 0.08742364 0.04782010 0.06935799 0.07016402 0.07475356 0.08060478
## [8] 0.08153396
  • For Complete linkage: The adjusted Rand indices for K=2,…,9 clusters ranges from 0.04 to 0.15. Adjusted Rand indices is largest(0.15) when K=2. This means that when K=2, it gives 0.15 agreement with the true class label.
p_C2 <- cutree(r_S, 2)                 # cluster labels of observations for K = 2

table(p_C2)  # 1=Nasal, 2=Oral
## p_C2
##   1   2 
## 999   1

Single linkage

set.seed(769)

ari = double(8)

for(k in 2:9) {
  r = hclust(d, method="single")                # complete linkage, by default
  ari[k-1] = adjustedRandIndex(phoneme2$Class, cutree(r, k))
}

ari
## [1] -0.0011781335  0.0005002740 -0.0006730085  0.0021753775  0.0050229980
## [6]  0.0038415529  0.0083236625  0.0143949554
  • For Single linkage: The adjusted Rand indices for K=2,…,9 clusters are about 0. 0 means a random allocation.

12. Produce the heatmaps for both the complete and single linkage methods.

Complete linkage

heatmap(as.matrix(phoneme[, -6]), scale="column", distfun=dist, hclustfun=hclust,
        margins=c(3,4))

Single linkage

heatmap(as.matrix(phoneme[, -6]), scale="column", distfun=dist, hclustfun=function(x) hclust(x,method = 'single'), margins=c(3,4))

Summary

In this lab, we explored the Phoneme data set sourced from Open ML. The description of which can be found here: https://www.openml.org/d/1489.

The response variable is Class and we use different unsupervised leaning methods to predict the class.

Reflect on the course topic that the lab is built around; what have we used, why, and how successful were we?

We get experience with unsupervised learning problems, some basic density estimation and clustering methods, and performance evaluation.

Specifically, we apply:

  1. Density Estimation
  • Univariate Density Estimation
  • Multivariate density estimation
  1. Clustering Methods
  • K-means
  • Mixture-based clustering
  • Hierarchical Clustering

We also successfully apply visualisation techniques to present the clusters.